We aim at developing a method to align and compare topics when
the number of topics is changed (varying K)
hyper-parameters of LDA are changed (e.g. varying alpha)
different modalities [? is there a better term than “modality” ?] exists for the same documents. For example, the same set of documents exists in different languages but we don’t have a direct translation of each word. Or, more commonly encountered in biology, the same samples have been analyzed for different -omics information, e.g. metagenomic or transcriptomic, and there is a desire to compare the topics from these different domains.
Models
We aim to compare the topics of \(M\) LDA models. Each specific model is denoted by \(m \in [1:M]\).
Topics
Each model \(M\) has \(K\) topics. Each topic is denoted by \(k \in [1:K]\).
Documents (Samples)
The dataset is composed of \(D\) documents (or samples). Each document/sample is denoted by \(d \in [1:D]\).
Words (features)
The dataset contains counts for a set of \(W\) words (or features). In biology, these features would be genes, transcripts, proteins, bacterial species, etc. Each word is denoted by an index \(w \in [1:W]\). The number of word \(w\) found in a specific document \(d\) is denoted by \(c_{w,d}\).
LDA model matrices
LDA models are defined by two matrices:
\(\beta\), which is a \(K \times W\) matrix where element \(\beta_{k,w}\) provides the proportion of word \(w\) in topic \(k\), and
\(\gamma\), which is a \(D \times K\) matrix where element \(\gamma_{d,k}\) provides the proportion of topic \(k\) in document \(d\).
In order words, an LDA finds topics such that each document is optimally described as a mixture of topics (\(\gamma\)), themselves characterized by a word probability (\(\beta\)).
For objectives (1) and (2), we can align topics using the \(\beta\) matrices from each model \(m\), while for objective (3), only matrix \(\gamma\) can be used to align topics.
We will thus first consider the problem of aligning topics using the \(\gamma\) matrices, then consider the “inverse” problem of aligning topics using the \(\beta\) matrices and discuss similarities and differences.
In both case, in addition to aligning topics between successive models (e.g. successive values of K or \(alpha\), or manually ordered modalities), we are also interested in computing and visualizing the alignment between each model and a reference model \(m^R\).
First, each document \(d\) is assigned a topic of reference \(k_R\) which is defined as the topic of the reference model \(m_R\) with the largest proportion for this document: \(k^R_d = \arg \max \gamma_{d,k^R}\).
We then compute the proportion of mass transferred between each topic of successive models as \(w^{\gamma}_{k^m, k^{m+1}} = \frac{1}{D} \sum_{d}^D \gamma_{d,k^m} \ \gamma_{d, k^{m+1}}\)
And if we desire to split these weights by reference topics, we have $w^{}{k^m, k{m+1},kR} = d{DR} {d,k^m} {d, k^{m+1}} $.
Consequently, the “height” of each topic \(h_{k^m}\) is \(h_{k^m} = \sum_d \gamma_{d,k^m}\). Topics that are the main topics of many documents have a larger “height” that topics that are secondary topics of many documents or the main topic of few documents.
To align topics based on the distribution of word probability in these topics, we first define the following concepts:
the average word frequency: \(f_w = \frac{1}{D} \sum_d^D f_{w,d}\) with \(f_{w,d} = \frac{c_{w,d}}{\sum_w^W c_{w,d}}\)
the “topic height”: \(h_{k^m} = \sum_w^W f_w \ \beta_{w,k}\)
the “reference topic height” in each topic: \(h_{k^m, k^R} = \sum_w^{W^R} f_w \ \beta_{w,k^m}\)
[NOTE: these definitions are sufficient to draw the composition of each topic for each \(m\), but to draw the flow between the topics, we need to find the optimal mass transfer - I kept writing down my notes, but it’s not super useful and it’s not implemented, instead, I implemented something a little ugly for the visualization of the flows]
the “word height” in each topic: \(h_{w, k^m} = \beta_{w,k^m} \ h_{k^m}\)
the modeled “word height” over all topics: XXXX
We have implemented the methods described above in a series of functions which can be ran sequentially:
run_LDA_models, which runs the LDA models for a specified set of \(K\)s, \(\alpha\)s or modalities.
(optional) trim_LDA_models filter the \(\beta\) matrix to keep only \(N\) (set by user) words or words with a probability of at least \(p_{min}\) (set by users) in any topic.
align_topics performs the topic alignment on the \(\gamma\) matrices and re-order the topics so that most align topics are “close to each others” [not sure how to formulate this]. Then, if possible, this function computes the alignment based on the \(\beta\) matrices [honestly, the solution with the \(\gamma\) is so elegant, easy and can be applied to any case, that I wonder if it’s even worth trying to do the alignment based on the \(\beta\) matrix. Let’s talk :)]. If a reference model is not provided, the last model is used as a reference.
visualize_topic_alignment computes the visualization layout and return a ggplot object with the alignment flow between topics.
Below is an example of how these functions are used on vaginal microbiome data.
# Libraries to attach
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(topicmodels)
library(slam)
# load the topic alignment functions
source("align_topic_functions.R")
# viz default theme
theme_set(theme_minimal())
load(file = "vm_16s_data.Rdata", verbose = TRUE)
## Loading objects:
## vm_16s
new_asv_names = colnames(vm_16s) %>%
str_split_fixed(., " ", n = 8) %>%
as.matrix() %>% .[,c(6, 7, 8)] %>%
as.data.frame() %>%
set_colnames(c("genus","species","strain")) %>%
mutate(short_name =
str_c(genus, " ",
species %>% str_replace(.,"NA","-")," ",
strain)) %>%
select(short_name) %>% unlist()
j = which(duplicated(new_asv_names))
new_asv_names[j] = str_c(new_asv_names[j], " (", 1:length(j),")")
colnames(vm_16s) = new_asv_names
vm_16s <- slam::as.simple_triplet_matrix(vm_16s %>% round())
topic_models_dir = "lda_models/"
lda_models =
run_lda_models(
data = vm_16s,
Ks = 1:13,
method = "VEM",
seed = 2,
dir = topic_models_dir
)
names(lda_models)
## [1] "betas" "gammas"
head(lda_models$betas)
## # A tibble: 6 x 5
## m K k_LDA w b
## <fct> <dbl> <chr> <chr> <dbl>
## 1 1 1 a Lactobacillus iners 1 0.297
## 2 1 1 a Lactobacillus crispatus 1 0.239
## 3 1 1 a Lactobacillus iners 2 0.0353
## 4 1 1 a Lactobacillus gasseri 1 0.0448
## 5 1 1 a Megasphaera - 1 0.0346
## 6 1 1 a Lactobacillus jensenii 1 0.0392
head(lda_models$gammas)
## # A tibble: 6 x 5
## m K k_LDA d g
## <fct> <dbl> <chr> <chr> <dbl>
## 1 1 1 a 1005601068 1
## 2 1 1 a 1005601078 1
## 3 1 1 a 1005601088 1
## 4 1 1 a 1005601098 1
## 5 1 1 a 1005601108 1
## 6 1 1 a 1005601118 1
aligned_topics =
align_topics(
data = asv_for_topic,
lda_models = lda_models
)
names(aligned_topics)
## [1] "lda_models" "gamma_alignment" "topics_order"
head(aligned_topics$gamma_alignment)
## # A tibble: 6 x 10
## m m_next m_ref k_LDA k_LDA_next k_LDA_ref w k k_next k_ref
## <fct> <fct> <fct> <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 1 2 13 a a a 0.00639 1 1 3
## 2 1 2 13 a a b 0.0104 1 1 13
## 3 1 2 13 a a c 0.00110 1 1 10
## 4 1 2 13 a a d 0.0141 1 1 12
## 5 1 2 13 a a e 0.00467 1 1 4
## 6 1 2 13 a a f 0.0666 1 1 5
# head(aligned_topics$beta_alignment) # not implemented
ggplot(aligned_topics$topics_order, aes(x = m, y = k, col = k_LDA)) +
geom_text(aes(label = k_LDA)) + guides(col = FALSE)
g_aligned_topics =
visualize_aligned_topics(
aligned_topics = aligned_topics,
add_leaves = TRUE,
min_beta = 0.05,
add_words_labels = TRUE
)
g_aligned_topics
g_aligned_topics =
visualize_aligned_topics(
aligned_topics = aligned_topics,
add_leaves = FALSE
)
g_aligned_topics
g_aligned_topics_ref =
visualize_aligned_topics(
aligned_topics = aligned_topics,
color_by = "reference",
add_leaves = FALSE
)
g_aligned_topics_ref
g_aligned_topics_ref =
visualize_aligned_topics(
aligned_topics = aligned_topics,
color_by = "reference",
add_leaves = TRUE
)
g_aligned_topics_ref